Apache Arrow
   HOME

TheInfoList



OR:

Apache Arrow is a
language-agnostic Language-agnostic programming or scripting (also called language-neutral, language-independent, or cross-language) is a software paradigm in which no particular language is promoted. In introductory instruction, the term refers to teaching princip ...
software framework In computer programming, a software framework is an abstraction in which software, providing generic functionality, can be selectively changed by additional user-written code, thus providing application-specific software. It provides a standard ...
for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and
GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...
hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of
dynamic random-access memory Dynamic random-access memory (dynamic RAM or DRAM) is a type of random-access semiconductor memory that stores each bit of data in a memory cell, usually consisting of a tiny capacitor and a transistor, both typically based on metal-oxide ...
.


Interoperability

Arrow can be used with
Apache Parquet Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing ...
,
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californi ...
, NumPy,
PySpark Apache Spark is an Open-source software, open-source unified analytics engine for large-scale data processing. Spark provides an application programming interface, interface for programming clusters with implicit data parallelism and fault tolera ...
,
pandas Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) is a controversial hypothetical diagnosis for a subset of children with rapid onset of obsessive-compulsive disorder (OCD) or tic disorders. Sy ...
and other data processing libraries. The project includes native
software libraries In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and subro ...
written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.


Applications

Arrow has been used in diverse domains, including analytics, genomics, and cloud computing.


Comparison to Apache Parquet and ORC

Apache Parquet Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing ...
and
Apache ORC The Apache () are a group of culturally related Native American tribes in the Southwestern United States, which include the Chiricahua, Jicarilla, Lipan, Mescalero, MimbreƱo, Ndendahe (Bedonkohe or Mogollon and Nednhi or CarrizaleƱo an ...
are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory. The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage. The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.


Governance

Apache Arrow was announced by
The Apache Software Foundation The Apache Software Foundation (ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open source software projects. The ASF was formed from a group of developers of the Ap ...
on February 17, 2016, with development led by a coalition of developers from other open source data analytics projects. The initial codebase and Java library was seeded by code from
Apache Drill Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's D ...
.


References


External links


Apache Arrow
project web site
Apache Arrow GitHub
project source code {{Apache Software Foundation Apache Software Foundation Apache Software Foundation projects Software frameworks